Marketing research on Olist E-Commerce¶

Midterm project proposal

1 Aims, objectives and introduction

1.1Introduction

Businesses have always made an effort to maintain their client base's interest in and satisfaction with the services they offer. They must include the most recent technical developments into their offerings if they want to remain competitive in the market. Back more than ten years ago, when the internet was still a relatively undeveloped technology, several industries tried to take advantage of its potential as a channel for customer communication between various enterprises.In this decade, industries have begun to offer services that are tailored to the specific requirements of each client. They must use artificial intelligence in order to provide these services.

E-commerce can be carried out using computers, tablets, smartphones, and other smart devices, so it operates in a wide range of categories. E-commerce transactions make almost every good or service conceivable approachable, including books, music, flight tickets, and financial services such share market and online banking.As a result it is seen as a very disruptive technology . The way people shop for and use products and services has evolved as a result of e-commerce. More and more consumers are now using their computers and other digital devices to make purchases for products that will be delivered right to their homes. As a result, it has revolutionized the retail business. Due to their immense popularity, Amazon and Alibaba have forced established retailers to change the way they conduct business.That's not all, though. Individual merchants have been participating in e-commerce transactions more frequently through their own personal websites, not to be outdone. Additionally, several buyers and sellers congregate on online marketplaces like eBay and Etsy to transact business.

I am conducting data-analytic research on the factors that could effect consumer satisfaction for e-commerce marketing. Is it the delayed orders? price of the item? or the user's location? Because there are numerous factors that influence how satisfied customers are when using e-commerce. Based on this study, we will determine what factors effect customers the most.For this analytic research I use python as the programming language and the jupiter notebook as the programming platform.

1.2 Importing the data

1.2.1 Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
import datetime as dt
import requests
from bs4 import BeautifulSoup

1.2.1 Aquire the data sets

Mergring the dataset of olist_orders , olist_order_item , olist_order_review and olist_cutomers_dataset into one table and display. From this table I can see all the data in one shot in the table

In [2]:
orders = pd.read_csv("archive\olist_orders_dataset.csv")
payments = pd.read_csv("archive\olist_order_items_dataset.csv")
reviews = pd.read_csv("archive\olist_order_reviews_dataset.csv")
states = pd.read_csv("archive\olist_customers_dataset.csv")

order_merge = pd.merge(payments, reviews, on='order_id')
pre_merge = pd.merge(order_merge, orders, on='order_id')
merge = pd.merge(pre_merge, states, on='customer_id')

merge.drop(columns =["seller_id","product_id","order_item_id","review_comment_title","review_id", 
                     "review_comment_message", "review_creation_date","review_answer_timestamp"], 
                       inplace = True)

new_col = ['order_id','customer_id','price', 'freight_value','review_score','customer_state',
           'order_status','shipping_limit_date','order_purchase_timestamp', 'order_approved_at',
           'order_delivered_carrier_date','order_delivered_customer_date','order_estimated_delivery_date']

merge = merge[new_col]
merge.head(10)
Out[2]:
order_id customer_id price freight_value review_score customer_state order_status shipping_limit_date order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 00010242fe8c5a6d1ba2dd792cb16214 3ce436f183e68e07877b285a838db11a 58.90 13.29 5 RJ delivered 2017-09-19 09:45:35 2017-09-13 08:59:02 2017-09-13 09:45:35 2017-09-19 18:34:16 2017-09-20 23:43:48 2017-09-29 00:00:00
1 00018f77f2f0320c557190d7a144bdd3 f6dd3ec061db4e3987629fe6b26e5cce 239.90 19.93 4 SP delivered 2017-05-03 11:05:13 2017-04-26 10:53:06 2017-04-26 11:05:13 2017-05-04 14:35:00 2017-05-12 16:04:24 2017-05-15 00:00:00
2 000229ec398224ef6ca0657da4fc703e 6489ae5e4333f3693df5ad4372dab6d3 199.00 17.87 5 MG delivered 2018-01-18 14:48:30 2018-01-14 14:33:31 2018-01-14 14:48:30 2018-01-16 12:36:48 2018-01-22 13:19:16 2018-02-05 00:00:00
3 00024acbcdf0a6daa1e931b038114c75 d4eb9395c8c0431ee92fce09860c5a06 12.99 12.79 4 SP delivered 2018-08-15 10:10:18 2018-08-08 10:00:35 2018-08-08 10:10:18 2018-08-10 13:28:00 2018-08-14 13:32:39 2018-08-20 00:00:00
4 00042b26cf59d7ce69dfabb4e55b4fd9 58dbd0b2d70206bf40e62cd34e84d795 199.90 18.14 5 SP delivered 2017-02-13 13:57:51 2017-02-04 13:57:51 2017-02-04 14:10:13 2017-02-16 09:46:09 2017-03-01 16:42:31 2017-03-17 00:00:00
5 00048cc3ae777c65dbb7d2a0634bc1ea 816cbea969fe5b689b39cfc97a506742 21.90 12.69 4 MG delivered 2017-05-23 03:55:27 2017-05-15 21:42:34 2017-05-17 03:55:27 2017-05-17 11:05:55 2017-05-22 13:44:35 2017-06-06 00:00:00
6 00054e8431b9d7675808bcb819fb4a32 32e2e6ab09e778d99bf2e0ecd4898718 19.90 11.85 4 SP delivered 2017-12-14 12:10:31 2017-12-10 11:53:48 2017-12-10 12:10:31 2017-12-12 01:07:48 2017-12-18 22:03:38 2018-01-04 00:00:00
7 000576fe39319847cbb9d288c5617fa6 9ed5e522dd9dd85b4af4a077526d8117 810.00 70.75 5 SP delivered 2018-07-10 12:30:45 2018-07-04 12:08:27 2018-07-05 16:35:48 2018-07-05 12:15:00 2018-07-09 14:04:07 2018-07-25 00:00:00
8 0005a1a1728c9d785b8e2b08b904576c 16150771dfd4776261284213b89c304e 145.95 11.65 1 SP delivered 2018-03-26 18:31:29 2018-03-19 18:40:33 2018-03-20 18:35:21 2018-03-28 00:37:42 2018-03-29 18:17:31 2018-03-29 00:00:00
9 0005f50442cb953dcd1d21e1fb923495 351d3cb2cee3c7fd0af6616c82df21d3 53.99 11.40 4 SP delivered 2018-07-06 14:10:56 2018-07-02 13:59:39 2018-07-02 14:10:56 2018-07-03 14:25:00 2018-07-04 17:28:31 2018-07-23 00:00:00

1.3 Aims and obectives

For this project I would research the following:

1.3.1 The aims of the project:

  1. Deciding what dataset to Importing and merging of datasets
  2. Removing of Duplicates and Cleaning of Data to analyze the different data
  3. Exploratory Data Analysis to observe thr relationship between review score and other variables like price,Shipping cost, delivery delay etc.
  4. map visualisation for late orders and average review score by region
  5. Build decision tree classifier to classify customer satisfaction

1.3.2 objective for the project:

This report is to exploring customer satisfaction Based on price, freight value, time difference and whether the delivery is delayed or not. Data sets will be mergering to get a relevent data onbord and initiate the data processing and EDA on that to find out what's causes most of customer satistaction on the e-commerce.Furthermore I will engage in map plotting under map visualisation to check the p Orders average review score in the Brazil and also what is the order Delay percentage in Brazil. Use map plotting to check the reviews score is easier to visualie. In the latter part of this report I will build a model using random forest classifier and seperate UNI-VARIATE MODELS using decision tree classifier for price, freight value and time difference. Lastly I will build a multi- variate classification tree to overview How good this data for build predictive models. My main oblective of this model is not to build classification models but as a future work I build several tree models. I will mainly focuse on finding the relationship between customer satisfaction Based on price, freight value, time difference and whether the delivery is delayed or not and creating two maps which are average review score and oder delay precentage over the Brazil geo location area.

2. Data

2.1 Data sourse

This Brazilian e-commerce purchase order dataset was obtained from the Olist Store. The dataset contains details on 100k orders placed between 2016 and 2018 on several Brazilian marketplaces. Its capabilities enable viewing an order from a variety of angles, including customer location, product qualities, order status, pricing, payment, and freight performance. A geolocation dataset that links Brazilian zip codes to lat/lng coordinates was also made available.References to the corporations and partners in the review text have been swapped out for the names of the great houses from Game of Thrones. This is actual commercial data that has been abstracted.This dataset is public on kaggle.com. The Olist the largest department store in Brazilian marketplaces, generously published this data set for analysis purposes of researches and data sets can be downloaded as csv files.

2.2 Data source relevance

My major goal is to analyse whether the hypothesis that the price of the goods, shipping costs, customer location and on-time delivery are all important elements that affect how satisfied an online buyer is. The data source is more reliable because this dataset includes actual information from their purchase orders. Olist is the biggest Brazilian online market place, according to the world wide web. I discovered through data analysis that the olist's clients come from all over the Barazil. therefore, my data sample represents the whole Brazilian online customer population. Olist published this data from their it self; it wasn't obtained from a third party data source. The fact that this data collection contains only historical data is another reason why it is pertinent to my analysis. Real-time data analysis will not be a part of my analysis. This data integrates with my data pipeline as pandas dataframe. Using pandas, I created a straightforward data pipeline.

2.3 Data model

Data include information about customers, products, orders, and customer reviews. Therefore, this data is sufficient for the purposes of my research analysis.To facilitate understanding and organizing, the original data is split into several datasets.

image.png

Source : https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?select=olist_customers_dataset.csv

As online buyers normally they will see reviews and compare price comparisons and check the delivery fees and the time of arrival. Those will affect online buyers’ decision to buy the product on the platform. I can gather the information with respect to above criteria from the selected Olist data sets. To overview the whole result of the review I used map visualization to check the average review score on the region this technique it easier to overview the overall rating of the region. This is easier to get the overall data and compare it. So this data setas contain geo graphical data that will make easy my requirement.

2.4 Alternative data sources

2.4.1 Web Scraping

Web scraping allows us to get data. Web scraping is a computerized technique for gathering massive volumes of data from websites. The lot of this data is unstructured in HTML format and is converted into structured data in a database or spreadsheet so that it can be used in multiple applications. To gather data from websites, web scraping can really be carried out using a variety of approaches. These include utilising specific APIs, web services, or even developing your own code entirely from scratch for web scraping. You may access the structured data on many huge websites, including Google, Twitter, Facebook, StackOverflow, and others, using their APIs. This is a simple web scraping programm that demonstrates how to gather data from an e-commerce platform, and it represents the  that how to scrape data from Amazon.com.

In [3]:
import requests 
from bs4 import BeautifulSoup as bs4

import time
import pandas as pd

header = {

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,fr;q=0.8',
'cache-control': 'max-age=0',
'downlink': '2.5',
'ect': '4g',
'rtt': '200',
'sec-ch-ua': '"Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
}


s = requests.session()
res = s.get('https://www.amazon.com/s?k=gaming+headsets&ref=nb_sb_noss',headers=header)

links = []
data = []
i=0
while True:
    soup= bs4(res.text,"lxml")
    prods=soup.find_all("div",{"data-component-type":"s-search-result"})
    #print(prods)
#title = ti.find_all('span',{"class":"a-size-base-plus a-color-base a-text-normal"})
    
    for prod in prods:        
        try:
            title = prod.find_all('span',{"class":"a-size-medium a-color-base a-text-normal"})
            title = title[0].text
            price = prod.find_all('span',{"class":"a-price-whole"})
            price = price[0].text
            rating = prod.find_all('span',{"class":"a-icon-alt"})
            rating = rating[0].text
            ratingcount = prod.find_all('span',{"class":"a-size-base s-underline-text"})
            ratingcount = ratingcount[0].text
            data.append(title)
            data.append(price)
            data.append(rating)
            data.append(ratingcount)
            print(title)
            print(price)
            print(rating)
            print(ratingcount)
            i = i+1
            
            if i>= 100:
                break
       
        except:
            break
    if i> 100:
        break
        
data = pd.DataFrame(data)
        
import numpy as np
title =[]
price =[]
rating =[]
ratingcount=[]


for x in range(0,len(data),4):
    title.append(data.iloc[x])


for x in range(1,len(data),4):
    price.append(data.iloc[x])
    

for x in range(2,len(data),4):
    rating.append(data.iloc[x])


for x in range(3,len(data),4):
    ratingcount.append(data.iloc[x])  
    
title = pd.DataFrame(title).reset_index().drop('index',axis = 1)
price = pd.DataFrame(price ).reset_index().drop('index',axis = 1)
rating = pd.DataFrame(rating ).reset_index().drop('index',axis = 1)
ratingcount = pd.DataFrame(ratingcount).reset_index().drop('index',axis = 1)

data2 =pd.concat([title,price,rating,ratingcount], axis = 1)
data2.columns =['title', 'price', 'rating', 'rating_count']
data2.head(3) # print first 3 rows
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
BENGOO G9000 Stereo Gaming Headset for PS4 PC Xbox One PS5 Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Nintendo NES Games
25.
4.3 out of 5 stars
98,735
MOVOYEE Gaming Earbuds with Microphone,Stereo Headphones Deep Bass Earphones Sport Earbuds Wired Detachable Noise Cancelling Gaming Headset for Mobile Phone Games PC PS4 Xbox One Playstation 5-Black
9.
4.0 out of 5 stars
1
Out[3]:
title price rating rating_count
0 MOVOYEE Gaming Earbuds with Microphone,Stereo ... 9. 4.0 out of 5 stars 1
1 BENGOO G9000 Stereo Gaming Headset for PS4 PC ... 25. 4.3 out of 5 stars 98,735
2 MOVOYEE Gaming Earbuds with Microphone,Stereo ... 9. 4.0 out of 5 stars 1

There are benefits and drawbacks to using web scraping as a data collection technique for my analysis. Web scraping has problems with storage capacity and using a lot of machine power and data are unstructured. Web scraping has the advantage of allowing for the extraction of real-time data, but my data pipeline is incompatible with real-time data.

2.5 limitation of Data constraints

Different people have a different reviews of the product

Because some of the product some online buyers might think is cheap some might think is not. We can’t base on one review to determine the result. By buying things online we need to compare a few reviews to make a decision so that we can trust the seller base on the reviews.

Some merchants buy reviews to gain customers.

For some of the new online sellers, we won’t know if the review it is real because there might be a chance of buying reviews and it is hard to check because it is a new seller.

Old dataset

This report data is from 2017 -2018 so the information is not the latest. Although this is an old dataset the objective of this report is to show the concept of how to find out whether customer satisfaction is more on the price, freight value, or the time difference of arrival.

2.6 Ethical consideration

The owner of these datasets has publicly published them. Downloads can be made from kaggle.com by anyone interested in analyzing this data. To make this data fit any machine learning model, I preprocessed it. Everyone is therefore welcome to analyze this data for future works. I made no distinctions or did anything that would be damaging or harmful to any parties when conducting this analysis. The assumptions I'll make for this analysis are just for my own goals. Intentionally, I will not discriminate against any individuals or groups. This pipeline for processing data is rather simple. I imported the data as pandas dataframes into Jupiter Notebook after downloading it as csv files. Pandas in Jupiter Notebook do the load and transformation steps.

3. Project Background

Since the first internet-based system for eCommerce, Compumarket, was presented by SequoiaData Crop in May 1989, it has a long history. eCommerce is evolving into a key component of business operations and a powerful engine for economic growth in the new growing global economy. Through increased competition, cost reductions, and modifications in sellers' pricing structures, the constant growth of eCommerce may lead in downward pressure on inflation. E-commerce is popular around the world for a number of reasons, including how it may help you cut costs, expand your business internationally, operate with fewer risks and overhead, provide better marketing opportunities, and ensure the highest level of transaction security.

There are several issues with the e-commerce sector that have not yet been addressed by research. In this article, I'll highlight just a few.

Since logistic service providers take orders from e-commerce clients with forwarding needs rather than e-tailers, they are unable to obtain the demand data that e-tailers shared. Only data from the platforms of the logistic service providers themselves is available for demand forecasting. As a result, it is challenging to estimate logistics demand precisely and effectively. As a result, it can be particularly difficult for logistic service providers to allocate the capacity of logistic facilities (such as the pickup store, various types of lockers, and vehicles) in different logistical regions. On the one hand, each logistics region's customer demand should be able to be satisfied by the capacities of the facilities that handle logistics. On the other side, fewer facilities should be used in order to lower the anticipated cost of facility upkeep and operation. Therefore, as a part of operational planning, the optimal logistics service capacity allocation must choose and assign enough resources to satisfy the logistics demand requirements in the upcoming planning cycle. The logistics company must therefore determine the most inexpensive and efficient way to allocate the capacity for flexible logistics services throughout the distribution network, Shuyun Ren et al(2020).

Although e-commerce lowers costs, increases revenues, and enhances the consumer experience, it also exposes merchants to major challenges from sophisticated fraudsters who exploit and defraud the online channel by taking advantage of its relative anonymity and simplicity. The fraud and abuse that retailers experience comes in many different forms. These put the reputation and profitability of the merchants at risk, and include account takeover, first-party fraud, free-trial abuse, fake product reviews, warranty fraud, refund fraud, reseller fraud, and misuse of program discounts, among many more. Some forms of fraud, such as money laundering and the spread of false information, have serious negative effects on society as a whole. E-commerce fraud is expanding quickly because, in contrast to the early years of the Internet, fraudsters today are well-funded and well-equipped professional rings. The current estimate for fraud losses in 2018 is 1.8% of online income. After taking into account the significant overhead associated with servicing fraud occurrences, e-commerce firms lose $250 billion yearly due to fraud charges, Jay Nanduri et al (2020).

This task's scope is limited to structured data analysis. Since the review score variable is an ordinal categorical variable, I am only analyzing it in that sense. However, the study of Olist data goes beyond simply analyzing review scores. Applying sentiment analysis to this data will allow any analyst to progress. Customer reviews are a feature of Olist data sets. This feature contains text data and is unstructured. It can be modeled using Natural Language Processing

I utilized a straightforward pandas data pipeline. The data set goes through a number of processes in the data pipeline. The steps of my data pipeline are as follows:

Importing the dataset.

Merge datasets in to one master dataset

Handling missing data.

Encoding categorical data.

Splitting the dataset (Train Set and Test set).

Feature Scaling

4. Data Processing

Data preprocessing is a phase in the data mining and data analysis process that converts raw data into a format that computers and machine learning algorithms can understand and evaluate.Text, images, video, and many other forms of raw, real-world data are disordered. In addition to the likelihood of inaccuracies and inconsistencies, it is often missing and lacks a regular, uniform format.Machines need to process information that is tidy; they read data as 1s and 0s. Thus, it is simple to calculate data like whole percentages and numbers. However, unstructured data needs to be cleaned and formatted  before analysis.

4.1 Changing of Data Types

I changed the data types of some variables so that we are able to work with the correct data types. Variables must be assigned the proper data type. This offers several advantages, including memory optimization, variable placement in the vector space, improvements to the machine learning model, and more.

To group them together, for instance, we needed the review score to be of category, and the variables with datetime needed to be of datetime64 so that we could determine the time difference.

In [4]:
merge['order_status'] = merge['order_status'].astype('category')
merge['review_score'] = merge['review_score'].astype('category')

merge['order_purchase_timestamp'] = merge['order_purchase_timestamp'].astype('datetime64')
merge['order_approved_at'] = merge['order_approved_at'].astype('datetime64')
merge['order_delivered_carrier_date'] = merge['order_delivered_carrier_date'].astype('datetime64')
merge['order_delivered_customer_date'] = merge['order_delivered_customer_date'].astype('datetime64')
merge['order_estimated_delivery_date'] = merge['order_estimated_delivery_date'].astype('datetime64')
merge['shipping_limit_date'] = merge['shipping_limit_date'].astype('datetime64')

merge.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 113322 entries, 0 to 113321
Data columns (total 13 columns):
 #   Column                         Non-Null Count   Dtype         
---  ------                         --------------   -----         
 0   order_id                       113322 non-null  object        
 1   customer_id                    113322 non-null  object        
 2   price                          113322 non-null  float64       
 3   freight_value                  113322 non-null  float64       
 4   review_score                   113322 non-null  category      
 5   customer_state                 113322 non-null  object        
 6   order_status                   113322 non-null  category      
 7   shipping_limit_date            113322 non-null  datetime64[ns]
 8   order_purchase_timestamp       113322 non-null  datetime64[ns]
 9   order_approved_at              113307 non-null  datetime64[ns]
 10  order_delivered_carrier_date   112119 non-null  datetime64[ns]
 11  order_delivered_customer_date  110847 non-null  datetime64[ns]
 12  order_estimated_delivery_date  113322 non-null  datetime64[ns]
dtypes: category(2), datetime64[ns](6), float64(2), object(3)
memory usage: 10.6+ MB

4.2 Creating new variables¶

The time difference between the predetermined order deliver date and the actual order deliver date is not described by any variables in the data set. So I subtracted the variables "order delivered customer date" and "order estimated delivery date" to generate a new variable. Whether or not the product was delivered on time is another factor that is crucial to consider when analyzing the report's overall goal. 'order estimated delivery date' and 'order estimated delivery date' were utilized to extract the bigger value in this case using the pandas.gt() method. This technique produces boolean output

In [5]:
merge["time_diff"] =  merge["order_delivered_customer_date"] - merge["order_estimated_delivery_date"]
merge["time_diff"] = merge['time_diff'].dt.days
merge["is_late"] = merge['order_estimated_delivery_date'].gt(merge['order_estimated_delivery_date'])
merge["is_late"] = merge["is_late"].astype("category")

4.3 Removing of Duplicates and Cleaning of Data¶

After which we drop rows with duplicates entries and calculated the time difference to determine if the orders are late. We also check if the orders are delivered on the same date as the estimated delivery date, it would not be considered late.I only used a for loop for that. I declare a variable called fixtime, which represents the index values of the "time diff" variable that are not missing values. When the time diff variable is equal to zero, I set the 'Is late' value to False by using the index where we obtained from the 'fixtime' variable.

In [6]:
merge = merge.drop_duplicates(subset=['order_id'])
merge = merge[merge['time_diff'].notna()]
merge = merge.reset_index(drop = True)

fixtime = merge.index[merge['time_diff'].isnull() == False]
for time in fixtime:
    if (merge["time_diff"][time] == 0):
        merge['is_late'][time] = False
        
merge["is_late"] = merge["is_late"].astype("boolean")

merge.head(10)
Out[6]:
order_id customer_id price freight_value review_score customer_state order_status shipping_limit_date order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date time_diff is_late
0 00010242fe8c5a6d1ba2dd792cb16214 3ce436f183e68e07877b285a838db11a 58.90 13.29 5 RJ delivered 2017-09-19 09:45:35 2017-09-13 08:59:02 2017-09-13 09:45:35 2017-09-19 18:34:16 2017-09-20 23:43:48 2017-09-29 -9.0 False
1 00018f77f2f0320c557190d7a144bdd3 f6dd3ec061db4e3987629fe6b26e5cce 239.90 19.93 4 SP delivered 2017-05-03 11:05:13 2017-04-26 10:53:06 2017-04-26 11:05:13 2017-05-04 14:35:00 2017-05-12 16:04:24 2017-05-15 -3.0 False
2 000229ec398224ef6ca0657da4fc703e 6489ae5e4333f3693df5ad4372dab6d3 199.00 17.87 5 MG delivered 2018-01-18 14:48:30 2018-01-14 14:33:31 2018-01-14 14:48:30 2018-01-16 12:36:48 2018-01-22 13:19:16 2018-02-05 -14.0 False
3 00024acbcdf0a6daa1e931b038114c75 d4eb9395c8c0431ee92fce09860c5a06 12.99 12.79 4 SP delivered 2018-08-15 10:10:18 2018-08-08 10:00:35 2018-08-08 10:10:18 2018-08-10 13:28:00 2018-08-14 13:32:39 2018-08-20 -6.0 False
4 00042b26cf59d7ce69dfabb4e55b4fd9 58dbd0b2d70206bf40e62cd34e84d795 199.90 18.14 5 SP delivered 2017-02-13 13:57:51 2017-02-04 13:57:51 2017-02-04 14:10:13 2017-02-16 09:46:09 2017-03-01 16:42:31 2017-03-17 -16.0 False
5 00048cc3ae777c65dbb7d2a0634bc1ea 816cbea969fe5b689b39cfc97a506742 21.90 12.69 4 MG delivered 2017-05-23 03:55:27 2017-05-15 21:42:34 2017-05-17 03:55:27 2017-05-17 11:05:55 2017-05-22 13:44:35 2017-06-06 -15.0 False
6 00054e8431b9d7675808bcb819fb4a32 32e2e6ab09e778d99bf2e0ecd4898718 19.90 11.85 4 SP delivered 2017-12-14 12:10:31 2017-12-10 11:53:48 2017-12-10 12:10:31 2017-12-12 01:07:48 2017-12-18 22:03:38 2018-01-04 -17.0 False
7 000576fe39319847cbb9d288c5617fa6 9ed5e522dd9dd85b4af4a077526d8117 810.00 70.75 5 SP delivered 2018-07-10 12:30:45 2018-07-04 12:08:27 2018-07-05 16:35:48 2018-07-05 12:15:00 2018-07-09 14:04:07 2018-07-25 -16.0 False
8 0005a1a1728c9d785b8e2b08b904576c 16150771dfd4776261284213b89c304e 145.95 11.65 1 SP delivered 2018-03-26 18:31:29 2018-03-19 18:40:33 2018-03-20 18:35:21 2018-03-28 00:37:42 2018-03-29 18:17:31 2018-03-29 0.0 False
9 0005f50442cb953dcd1d21e1fb923495 351d3cb2cee3c7fd0af6616c82df21d3 53.99 11.40 4 SP delivered 2018-07-06 14:10:56 2018-07-02 13:59:39 2018-07-02 14:10:56 2018-07-03 14:25:00 2018-07-04 17:28:31 2018-07-23 -19.0 False
In [7]:
merge.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96476 entries, 0 to 96475
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   order_id                       96476 non-null  object        
 1   customer_id                    96476 non-null  object        
 2   price                          96476 non-null  float64       
 3   freight_value                  96476 non-null  float64       
 4   review_score                   96476 non-null  category      
 5   customer_state                 96476 non-null  object        
 6   order_status                   96476 non-null  category      
 7   shipping_limit_date            96476 non-null  datetime64[ns]
 8   order_purchase_timestamp       96476 non-null  datetime64[ns]
 9   order_approved_at              96462 non-null  datetime64[ns]
 10  order_delivered_carrier_date   96475 non-null  datetime64[ns]
 11  order_delivered_customer_date  96476 non-null  datetime64[ns]
 12  order_estimated_delivery_date  96476 non-null  datetime64[ns]
 13  time_diff                      96476 non-null  float64       
 14  is_late                        96476 non-null  boolean       
dtypes: boolean(1), category(2), datetime64[ns](6), float64(3), object(3)
memory usage: 9.2+ MB

Here we updated the column names so that they are all upper case, for better visualization.

We also spotted some null values, thus we filled them in.

In [8]:
data_clean = merge.copy()

data_clean = data_clean.rename(columns = str.upper)

data_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96476 entries, 0 to 96475
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   ORDER_ID                       96476 non-null  object        
 1   CUSTOMER_ID                    96476 non-null  object        
 2   PRICE                          96476 non-null  float64       
 3   FREIGHT_VALUE                  96476 non-null  float64       
 4   REVIEW_SCORE                   96476 non-null  category      
 5   CUSTOMER_STATE                 96476 non-null  object        
 6   ORDER_STATUS                   96476 non-null  category      
 7   SHIPPING_LIMIT_DATE            96476 non-null  datetime64[ns]
 8   ORDER_PURCHASE_TIMESTAMP       96476 non-null  datetime64[ns]
 9   ORDER_APPROVED_AT              96462 non-null  datetime64[ns]
 10  ORDER_DELIVERED_CARRIER_DATE   96475 non-null  datetime64[ns]
 11  ORDER_DELIVERED_CUSTOMER_DATE  96476 non-null  datetime64[ns]
 12  ORDER_ESTIMATED_DELIVERY_DATE  96476 non-null  datetime64[ns]
 13  TIME_DIFF                      96476 non-null  float64       
 14  IS_LATE                        96476 non-null  boolean       
dtypes: boolean(1), category(2), datetime64[ns](6), float64(3), object(3)
memory usage: 9.2+ MB

4.4 Missing values¶

I use paqndas isnull() method to find the missing values

In [9]:
data_clean.isnull().sum()
Out[9]:
ORDER_ID                          0
CUSTOMER_ID                       0
PRICE                             0
FREIGHT_VALUE                     0
REVIEW_SCORE                      0
CUSTOMER_STATE                    0
ORDER_STATUS                      0
SHIPPING_LIMIT_DATE               0
ORDER_PURCHASE_TIMESTAMP          0
ORDER_APPROVED_AT                14
ORDER_DELIVERED_CARRIER_DATE      1
ORDER_DELIVERED_CUSTOMER_DATE     0
ORDER_ESTIMATED_DELIVERY_DATE     0
TIME_DIFF                         0
IS_LATE                           0
dtype: int64

Dataset has verry less missing values from ORDER_APPROVED_AT variable and ORDER_DELIVERED_CARRIER_DATE variable

In [10]:
len(data_clean.index)
Out[10]:
96476

I impute missing values by using pandas fillna() method. I replace the missing values by 'No_Time' string

In [11]:
data_clean["ORDER_APPROVED_AT"].fillna(value = "No_Time", inplace = True)
data_clean["ORDER_DELIVERED_CARRIER_DATE"].fillna(value = "No_Time", inplace = True)
In [12]:
data_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96476 entries, 0 to 96475
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   ORDER_ID                       96476 non-null  object        
 1   CUSTOMER_ID                    96476 non-null  object        
 2   PRICE                          96476 non-null  float64       
 3   FREIGHT_VALUE                  96476 non-null  float64       
 4   REVIEW_SCORE                   96476 non-null  category      
 5   CUSTOMER_STATE                 96476 non-null  object        
 6   ORDER_STATUS                   96476 non-null  category      
 7   SHIPPING_LIMIT_DATE            96476 non-null  datetime64[ns]
 8   ORDER_PURCHASE_TIMESTAMP       96476 non-null  datetime64[ns]
 9   ORDER_APPROVED_AT              96476 non-null  object        
 10  ORDER_DELIVERED_CARRIER_DATE   96476 non-null  object        
 11  ORDER_DELIVERED_CUSTOMER_DATE  96476 non-null  datetime64[ns]
 12  ORDER_ESTIMATED_DELIVERY_DATE  96476 non-null  datetime64[ns]
 13  TIME_DIFF                      96476 non-null  float64       
 14  IS_LATE                        96476 non-null  boolean       
dtypes: boolean(1), category(2), datetime64[ns](4), float64(3), object(5)
memory usage: 9.2+ MB
In [13]:
data_clean.head(10)
Out[13]:
ORDER_ID CUSTOMER_ID PRICE FREIGHT_VALUE REVIEW_SCORE CUSTOMER_STATE ORDER_STATUS SHIPPING_LIMIT_DATE ORDER_PURCHASE_TIMESTAMP ORDER_APPROVED_AT ORDER_DELIVERED_CARRIER_DATE ORDER_DELIVERED_CUSTOMER_DATE ORDER_ESTIMATED_DELIVERY_DATE TIME_DIFF IS_LATE
0 00010242fe8c5a6d1ba2dd792cb16214 3ce436f183e68e07877b285a838db11a 58.90 13.29 5 RJ delivered 2017-09-19 09:45:35 2017-09-13 08:59:02 2017-09-13 09:45:35 2017-09-19 18:34:16 2017-09-20 23:43:48 2017-09-29 -9.0 False
1 00018f77f2f0320c557190d7a144bdd3 f6dd3ec061db4e3987629fe6b26e5cce 239.90 19.93 4 SP delivered 2017-05-03 11:05:13 2017-04-26 10:53:06 2017-04-26 11:05:13 2017-05-04 14:35:00 2017-05-12 16:04:24 2017-05-15 -3.0 False
2 000229ec398224ef6ca0657da4fc703e 6489ae5e4333f3693df5ad4372dab6d3 199.00 17.87 5 MG delivered 2018-01-18 14:48:30 2018-01-14 14:33:31 2018-01-14 14:48:30 2018-01-16 12:36:48 2018-01-22 13:19:16 2018-02-05 -14.0 False
3 00024acbcdf0a6daa1e931b038114c75 d4eb9395c8c0431ee92fce09860c5a06 12.99 12.79 4 SP delivered 2018-08-15 10:10:18 2018-08-08 10:00:35 2018-08-08 10:10:18 2018-08-10 13:28:00 2018-08-14 13:32:39 2018-08-20 -6.0 False
4 00042b26cf59d7ce69dfabb4e55b4fd9 58dbd0b2d70206bf40e62cd34e84d795 199.90 18.14 5 SP delivered 2017-02-13 13:57:51 2017-02-04 13:57:51 2017-02-04 14:10:13 2017-02-16 09:46:09 2017-03-01 16:42:31 2017-03-17 -16.0 False
5 00048cc3ae777c65dbb7d2a0634bc1ea 816cbea969fe5b689b39cfc97a506742 21.90 12.69 4 MG delivered 2017-05-23 03:55:27 2017-05-15 21:42:34 2017-05-17 03:55:27 2017-05-17 11:05:55 2017-05-22 13:44:35 2017-06-06 -15.0 False
6 00054e8431b9d7675808bcb819fb4a32 32e2e6ab09e778d99bf2e0ecd4898718 19.90 11.85 4 SP delivered 2017-12-14 12:10:31 2017-12-10 11:53:48 2017-12-10 12:10:31 2017-12-12 01:07:48 2017-12-18 22:03:38 2018-01-04 -17.0 False
7 000576fe39319847cbb9d288c5617fa6 9ed5e522dd9dd85b4af4a077526d8117 810.00 70.75 5 SP delivered 2018-07-10 12:30:45 2018-07-04 12:08:27 2018-07-05 16:35:48 2018-07-05 12:15:00 2018-07-09 14:04:07 2018-07-25 -16.0 False
8 0005a1a1728c9d785b8e2b08b904576c 16150771dfd4776261284213b89c304e 145.95 11.65 1 SP delivered 2018-03-26 18:31:29 2018-03-19 18:40:33 2018-03-20 18:35:21 2018-03-28 00:37:42 2018-03-29 18:17:31 2018-03-29 0.0 False
9 0005f50442cb953dcd1d21e1fb923495 351d3cb2cee3c7fd0af6616c82df21d3 53.99 11.40 4 SP delivered 2018-07-06 14:10:56 2018-07-02 13:59:39 2018-07-02 14:10:56 2018-07-03 14:25:00 2018-07-04 17:28:31 2018-07-23 -19.0 False

Now the dataset is cleaned and ready for the exploratory data analysis

In [14]:
data_clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96476 entries, 0 to 96475
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   ORDER_ID                       96476 non-null  object        
 1   CUSTOMER_ID                    96476 non-null  object        
 2   PRICE                          96476 non-null  float64       
 3   FREIGHT_VALUE                  96476 non-null  float64       
 4   REVIEW_SCORE                   96476 non-null  category      
 5   CUSTOMER_STATE                 96476 non-null  object        
 6   ORDER_STATUS                   96476 non-null  category      
 7   SHIPPING_LIMIT_DATE            96476 non-null  datetime64[ns]
 8   ORDER_PURCHASE_TIMESTAMP       96476 non-null  datetime64[ns]
 9   ORDER_APPROVED_AT              96476 non-null  object        
 10  ORDER_DELIVERED_CARRIER_DATE   96476 non-null  object        
 11  ORDER_DELIVERED_CUSTOMER_DATE  96476 non-null  datetime64[ns]
 12  ORDER_ESTIMATED_DELIVERY_DATE  96476 non-null  datetime64[ns]
 13  TIME_DIFF                      96476 non-null  float64       
 14  IS_LATE                        96476 non-null  boolean       
dtypes: boolean(1), category(2), datetime64[ns](4), float64(3), object(5)
memory usage: 9.2+ MB

5 Exploratory Data Analysis

4.1 Predicting of Review Score

We start by looking at the spread of the variable and against the other variables we are planning to use as a predictor.

In [15]:
data_clean['REVIEW_SCORE'].describe()
Out[15]:
count     96476
unique        5
top           5
freq      56841
Name: REVIEW_SCORE, dtype: int64

4.2 Distribution of REVIEW_SCORE¶

In [16]:
sb.catplot(y = 'REVIEW_SCORE', data = data_clean, kind = "count")
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x292d0408dc0>

Most customers have given reviews with the highest rating.We can clearly see that our target variable is imbalance.To balance the target variable, we must use some resampling approaches.

4.3 PRICE vs REVIEW_SCORE¶

In [17]:
f = plt.figure(figsize=(16, 8))
sb.boxplot(x = 'PRICE', y = 'REVIEW_SCORE', data = data_clean)
plt.show()

As the majority of customer reviews from each and every review score are distributed around less than 500 price value, it appears that PRICE has little effect on the REVIEW SCORE. Customers ordering the product are not surprised by the price. When they order, they are satisfied with the pricing since they have a better understanding of it. Otherwise, customers steer clear of it and don't buy it. That is why, in my opinion, the REVIEW SCORE is distributed at random and price has no impact on customer satisfaction.

4.4 FREIGHT_VALUE vs REVIEW_SCORE¶

In [18]:
f = plt.figure(figsize=(16, 8))
sb.boxplot(x = 'FREIGHT_VALUE', y = 'REVIEW_SCORE', data = data_clean)
plt.show()

FREIGHT_VALUE is behaving similar to PRICE in this vector space. It makes little impact on the target variable

4.5 TIME_DIFF vs REVIEW_SCORE¶

In [19]:
f = plt.figure(figsize=(16, 8))
sb.boxplot(x = 'TIME_DIFF', y = 'REVIEW_SCORE', data = data_clean)
plt.show()

This distribution is not random. We can clearly see that when the TIME_DIFF increasing REVIEW_SCORE decreasing. Scale of the TIME_DIFF is vary from PRICE. So Little change in TIME_DIFF will make a huge impact on our target variable. But the PRICE is distributed between very large range while TIME_DIFF is distributing in small range.

4.6 Correlation analysis¶

In [20]:
data_clean['REVIEW_SCORE'] = data_clean['REVIEW_SCORE'].astype('int64')
sb.heatmap(data_clean.corr(), vmin = -1, vmax = 1, annot = True, fmt=".2f")
data_clean['REVIEW_SCORE'] = data_clean['REVIEW_SCORE'].astype('category')
plt.show()

REVIEW_SCORE does not have much relation with PRICE and FREIGHT_VALUE.

It is clear that REVIEW_SCORE have a negative relation with both TIME_DIFF and IS_LATE. Hence, we know can see that as the customer will give a lower score when he receives the item past the estimated delivery time

4.7 Map

4.7.1 map visualisation for late orders and average review score by region¶

We need to import several packages to map visualization. I am importing those packages in below cell. For visualize map I used Olist olist_geolocation_dataset.csv data file. It corresponds to the first three digits of each zip code with 323,000 lat/lng coordinates.

In [21]:
# plot wtih datashader - image with black background
import holoviews as hv
import geoviews as gv
import datashader as ds
from colorcet import fire, rainbow, bgy, bjy, bkr, kb, kr
from datashader.colors import colormap_select, Greys9
from holoviews.streams import RangeXY
from holoviews.operation.datashader import datashade, dynspread, rasterize
from bokeh.io import push_notebook, show, output_notebook
from datashader.utils import lnglat_to_meters as webm
# installing dependicies -> datashader, holoviews
import sys
# !conda install --yes --prefix {sys.prefix} datashader
# !conda install --yes --prefix {sys.prefix} colorcet
# !conda install --yes --prefix {sys.prefix} functools

Extracting and refactoring the columns

CEP: the Brazilian Zip Code¶

A Brazilian zip code, usually referred to as a CEP, is an 8-digit number that stands for the postal addressing code (Código de Endereçamento Postal). It was initially introduced in 1972 as a series of five digits, but in 1992 it was increased to eight digits to enable more accurate localization. The format is typically "nnnnn-nnn" (the original five digits, an hyphen, and the new three digits).

Every public space and some high-occupancy private spaces, such as significant commercial buildings and sizable residential condos, have a CEP allocated to them in the majority of cities with populations of 100,000 or more. A generic 5-digit code followed by the suffix -000 is given to small towns.

Region, Subregion, Sector, Subsector, and Subsector Splitter are represented by the first part's first five numbers. The Distribution Identifiers are represented by the second portion, which has three digits and is separated from the first by a hyphen.

Let's examine the geolocation dataset provided by Olist and attempt to comprehend the geographical operation of CEP.

In [22]:
output_notebook()
hv.extension('bokeh')

geo = pd.read_csv("archive\olist_geolocation_dataset.csv", dtype={'geolocation_zip_code_prefix': str})

# extract first 4 digits of the zip codes
geo['geolocation_zip_code_prefix_1_digits'] = geo['geolocation_zip_code_prefix'].str[0:1]
geo['geolocation_zip_code_prefix_2_digits'] = geo['geolocation_zip_code_prefix'].str[0:2]
geo['geolocation_zip_code_prefix_3_digits'] = geo['geolocation_zip_code_prefix'].str[0:3]
geo['geolocation_zip_code_prefix_4_digits'] = geo['geolocation_zip_code_prefix'].str[0:4]

geo.head(3)

x, y = webm(geo.geolocation_lng, geo.geolocation_lat)
geo['x'] = pd.Series(x)
geo['y'] = pd.Series(y)

# transforming the prefixes to int for plotting purposes
geo['geolocation_zip_code_prefix'] = geo['geolocation_zip_code_prefix'].astype(int)
geo['geolocation_zip_code_prefix_1_digits'] = geo['geolocation_zip_code_prefix_1_digits'].astype(int)
geo['geolocation_zip_code_prefix_2_digits'] = geo['geolocation_zip_code_prefix_2_digits'].astype(int)
geo['geolocation_zip_code_prefix_3_digits'] = geo['geolocation_zip_code_prefix_3_digits'].astype(int)
geo['geolocation_zip_code_prefix_4_digits'] = geo['geolocation_zip_code_prefix_4_digits'].astype(int)



geo.head(3)
Loading BokehJS ...
Out[22]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state geolocation_zip_code_prefix_1_digits geolocation_zip_code_prefix_2_digits geolocation_zip_code_prefix_3_digits geolocation_zip_code_prefix_4_digits x y
0 1037 -23.545621 -46.639292 sao paulo SP 0 1 10 103 -5.191862e+06 -2.698137e+06
1 1046 -23.546081 -46.644820 sao paulo SP 0 1 10 104 -5.192478e+06 -2.698193e+06
2 1046 -23.546129 -46.642951 sao paulo SP 0 1 10 104 -5.192270e+06 -2.698199e+06

19.051 distinct zip code prefixes exist. Each prefix has an average of 52.6 coordinates. One prefix with 1.146 coordinates is however available.

In [23]:
geo['geolocation_zip_code_prefix'].value_counts().to_frame().describe()
Out[23]:
geolocation_zip_code_prefix
count 19015.000000
mean 52.598633
std 72.057907
min 1.000000
25% 10.000000
50% 29.000000
75% 66.500000
max 1146.000000

The dataset contains some outlier coordinates that are outside of Brazil. Let's ensure that every coordinate falls within a rectangle defined by Brazil's borders.

In [24]:
# Removing some outliers
#Brazils most Northern spot is at 5 deg 16′ 27.8″ N latitude.;
geo = geo[geo.geolocation_lat <= 5.27438888]
#it’s most Western spot is at 73 deg, 58′ 58.19″W Long.
geo = geo[geo.geolocation_lng >= -73.98283055]
#It’s most southern spot is at 33 deg, 45′ 04.21″ S Latitude.
geo = geo[geo.geolocation_lat >= -33.75116944]
#It’s most Eastern spot is 34 deg, 47′ 35.33″ W Long.
geo = geo[geo.geolocation_lng <=  -34.79314722]

Now we are good to go further. Let's build the maps

In [25]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
import holoviews as hv

output_notebook()
hv.extension('bokeh')

%opts Overlay [width=900 height=600 toolbar='above' xaxis=None yaxis=None]
%opts QuadMesh [tools=['hover'] colorbar=True] (alpha=0 hover_alpha=0.2)

T = 0.05
PX = 1

def plot_map(data, label, agg_data, agg_name, cmap):
    url="http://server.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Dark_Gray_Base/MapServer/tile/{Z}/{Y}/{X}.png"
    geomap = gv.WMTS(url)
    points = hv.Points(gv.Dataset(data, kdims=['x', 'y'], vdims=[agg_name]))
    agg = datashade(points, element_type=gv.Image, aggregator=agg_data, cmap=cmap)
    zip_codes = dynspread(agg, threshold=T, max_px=PX)
    hover = hv.util.Dynamic(rasterize(points, aggregator=agg_data, width=100, height=50, streams=[RangeXY]), operation=hv.QuadMesh)
    hover = hover.options(cmap=cmap)
    img = geomap * zip_codes * hover
    img = img.relabel(label)
    return img
Loading BokehJS ...

Define map ploting function

In [26]:
orders_df = pd.read_csv('archive/olist_orders_dataset.csv')
order_items = pd.read_csv('archive/olist_order_items_dataset.csv')
order_reviews = pd.read_csv('archive/olist_order_reviews_dataset.csv')
customer = pd.read_csv('archive/olist_customers_dataset.csv', dtype={'customer_zip_code_prefix': str})

# getting the first 3 digits of customer zipcode
customer['customer_zip_code_prefix_3_digits'] = customer['customer_zip_code_prefix'].str[0:3]
customer['customer_zip_code_prefix_3_digits'] = customer['customer_zip_code_prefix_3_digits'].astype(int)

brazil_geo = geo.set_index('geolocation_zip_code_prefix_3_digits').copy()

orders = orders_df.merge(order_items, on='order_id')
orders = orders.merge(customer, on='customer_id')
orders = orders.merge(order_reviews, on='order_id')

orders.head(3)  
gp = orders.groupby('customer_zip_code_prefix_3_digits')['review_score'].mean().to_frame()
score = brazil_geo.join(gp)
agg_name = 'avg_score'
score[agg_name] = score['review_score']

plot_map(score, 'Orders Average Review Score', ds.mean(agg_name), agg_name, cmap=bgy)
Out[26]:

Customers are more likely to provide poor ratings on purchases in the Northeast Region and the State of Rio de Janeiro.

In [27]:
orders['is_delayed'] = orders['order_delivered_customer_date'] > orders['order_estimated_delivery_date'] 
gp = orders.groupby('customer_zip_code_prefix_3_digits').agg({'is_delayed': ['sum', 'count']})
agg_name = 'delayed'
gp[agg_name] = gp['is_delayed']['sum'] / gp['is_delayed']['count']
gp = gp[agg_name]
order_delay = brazil_geo.join(gp)

plot_map(order_delay, 'Orders Delay Percentage in Brazil', ds.mean(agg_name), agg_name, cmap=bgy)
Out[27]:

To see where order delivery are most likely to be delayed, consider Rio de Janeiro once more.

5. RANDOM FOREST CLASSIFIER¶

I have thus far looked into the relationship between review score and price, fright value, delivery time differential, and whether or not delivery is delayed. I'm going to get the predictions by feeding the data into a random forest classifier right now.

5.1 Prepare the dataset to feed random forest classifier¶

Since we are focused about determining how the price, fright value, time difference, and whether or not the delivery was delayed would affect the review score,  I will thus extract those variables in that regard from the prepared dataset and reindex according to feature variable datase.

In [28]:
clean_num = data_clean[['PRICE','FREIGHT_VALUE','TIME_DIFF','IS_LATE']]
clean_res = data_clean['REVIEW_SCORE']
clean_ohe = pd.concat([clean_num, clean_res], 
                           sort = False, axis = 1).reindex(index=clean_num.index)

clean_ohe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96476 entries, 0 to 96475
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   PRICE          96476 non-null  float64 
 1   FREIGHT_VALUE  96476 non-null  float64 
 2   TIME_DIFF      96476 non-null  float64 
 3   IS_LATE        96476 non-null  boolean 
 4   REVIEW_SCORE   96476 non-null  category
dtypes: boolean(1), category(1), float64(3)
memory usage: 2.5 MB

5.2 One hot encoding¶

The random forest classifier cannot be fed categorical variables. We must perform one hot encoding for it. For one hot encoding, I utilized the pandas get dummy method.

In [29]:
clean_ohe = pd.get_dummies(clean_ohe,columns =['IS_LATE'],drop_first = True )
clean_ohe.head()
Out[29]:
PRICE FREIGHT_VALUE TIME_DIFF REVIEW_SCORE
0 58.90 13.29 -9.0 5
1 239.90 19.93 -3.0 4
2 199.00 17.87 -14.0 5
3 12.99 12.79 -6.0 4
4 199.90 18.14 -16.0 5

5.3 Here we balance out all the classes of the response variable `REVIEW_SCORE` in the data.

We must make sure that our target variable is balanced before feeding the data into the random forest classifier. Balanced in the sense that the class labels for the target variable should be fairly distributed among the data instances. The majority class label can be reduced using undersampling, the minority class label can be increased using oversampling, and SMOTE can be used. I simply distributed the class labels throughout the data samples in this analysis.

In [30]:
# Upsample Bad to match Good
from sklearn.utils import resample

review_1 = clean_ohe[clean_ohe.REVIEW_SCORE == 1]
review_2 = clean_ohe[clean_ohe.REVIEW_SCORE == 2]
review_3 = clean_ohe[clean_ohe.REVIEW_SCORE == 3]
review_4 = clean_ohe[clean_ohe.REVIEW_SCORE == 4]
review_5 = clean_ohe[clean_ohe.REVIEW_SCORE == 5]
 
# Upsample the Bad samples
review_1_up = resample(review_1, 
                     replace=True,                     # sample with replacement
                     n_samples=review_5.shape[0])    # to match number of Good

review_2_up = resample(review_2, 
                     replace=True,                     # sample with replacement
                     n_samples=review_5.shape[0])    # to match number of Good

review_3_up = resample(review_3, 
                     replace=True,                     # sample with replacement
                     n_samples=review_5.shape[0])    # to match number of Good

review_4_up = resample(review_4, 
                     replace=True,                     # sample with replacement
                     n_samples=review_5.shape[0])    # to match number of Good
 
# Combine the two classes back after upsampling
review_ohe_up = pd.concat([review_5, review_1_up, review_2_up, review_3_up, review_4_up])
 
# Check the ratio of the classes
review_ohe_up['REVIEW_SCORE'].value_counts()
Out[30]:
1    56841
2    56841
3    56841
4    56841
5    56841
Name: REVIEW_SCORE, dtype: int64

Now the class labels are evenly distributed

Quick plot to check the balanced classes visually¶

In [31]:
sb.catplot(y = 'REVIEW_SCORE', data = review_ohe_up, kind = "count")
plt.show()
In [32]:
review_ohe_up.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 284205 entries, 0 to 2113
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   PRICE          284205 non-null  float64 
 1   FREIGHT_VALUE  284205 non-null  float64 
 2   TIME_DIFF      284205 non-null  float64 
 3   REVIEW_SCORE   284205 non-null  category
dtypes: category(1), float64(3)
memory usage: 8.9 MB

5.4 Build the random forest calssification model¶

In this case, I divided the dataset into 25% test data and 75% training data. I set the random forest classifier's tree count to 100 and the number of levels in each tree to 10.

In [33]:
# Import essential models and functions from sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

y = pd.DataFrame(review_ohe_up['REVIEW_SCORE'])
X = pd.DataFrame(review_ohe_up.drop('REVIEW_SCORE', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 100,  # CHANGE AND EXPERIMENT
                                 max_depth = 10)       # CHANGE AND EXPERIMENT

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.REVIEW_SCORE.ravel())
Out[33]:
RandomForestClassifier(max_depth=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=10)
In [34]:
y_train_pred = rforest.predict(X_train)
y_test_pred = rforest.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", rforest.score(X_train, y_train))
print()
# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", rforest.score(X_test, y_test))
print()

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
plt.show()
Goodness of Fit of Model 	Train Dataset
Classification Accuracy 	: 0.3714280352610566

Goodness of Fit of Model 	Test Dataset
Classification Accuracy 	: 0.35618983279851374

According to the Random Forest Classifier, the classification accuracy for the train set and test set is only about 35%, which is not a very high figure. Since the model's test and training accuracy are nearly identical, we cannot tell if it is overfitting. The fact that the data I used did not perform better for the random forest classifier is the cause of the lower accuracy rate. The original dataset, which contains all the variables, must be used. We can go for a decision tree classifier instead of random forest classifier Prior to that, it is preferable to run a univariate decision tree model and examine the effect that price, fright value, time difference, and delay variables have on the target variables.

6. EVALUATION OF UNI-VARIATE MODELS WITH DECISION TREE CLASSIFIER

Since I utilized resampled data in the last random forest classification model, but the model did not perform well on that set of data, I am going to use the data without using any resampling techniques. Kappa score is a useful evaluation tool for class data that is unbalanced. So instead of using accuracy, precision, and recall values, I utilize kappa score. Kappa is always either zero or less than one. Values of 0 or less signify the classifier is ineffective. Its values cannot be understood in a consistent manner. Values can be described in terms provided by Landis and Koch (1977). Their system classifies values between 0 and 0.20 as slight, 0.21 and 0.40 as fair, 0.41 and 0.60 as moderate, and 0.61 and 0.80 as substantial and 0.81-1 as perfect

6.1 Price

In [35]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from sklearn import metrics

y = pd.DataFrame(data_clean['REVIEW_SCORE'])   # Response
X = pd.DataFrame(data_clean['PRICE'])       # Predictor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3, random_state =142)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
print(metrics.classification_report(y_train,y_train_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Train Set:\t")
print(cohen_kappa_score(y_train,y_train_pred))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
print("Accuracy  :\t", dectree.score(X_test, y_test))
print()
print(metrics.classification_report(y_test,y_test_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Test Set:\t")
print(cohen_kappa_score(y_test,y_test_pred))

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
plt.show()
Goodness of Fit of Model 	Train Dataset
Classification Accuracy 	: 0.5899912931713587

              precision    recall  f1-score   support

           1    0.66667   0.00028   0.00055      7224
           2    0.00000   0.00000   0.00000      2271
           3    0.00000   0.00000   0.00000      5958
           4    0.00000   0.00000   0.00000     14215
           5    0.58999   0.99998   0.74212     42689

    accuracy                        0.58999     72357
   macro avg    0.25133   0.20005   0.14854     72357
weighted avg    0.41464   0.58999   0.43789     72357


Kappa Score for Train Set:	
8.326463123198291e-05

Goodness of Fit of Model 	Test Dataset
Classification Accuracy 	: 0.5867573282474398

Accuracy  :	 0.5867573282474398

              precision    recall  f1-score   support

           1    0.00000   0.00000   0.00000      2471
           2    0.00000   0.00000   0.00000       724
           3    0.00000   0.00000   0.00000      2045
           4    0.00000   0.00000   0.00000      4727
           5    0.58676   1.00000   0.73957     14152

    accuracy                        0.58676     24119
   macro avg    0.11735   0.20000   0.14791     24119
weighted avg    0.34428   0.58676   0.43395     24119


Kappa Score for Test Set:	
0.0

The Kappa score value is less than zero. Thus, the kappa value indicates that the price predictor is worthless for predicting review score. Trianing accuracy and the test accuracy are pretty much same. According to the confusion matrix model is predicting all the review scores as 4. Only verry few are predicted as 3. This is why accuracy score is not a good evaluation matrix. So this model is useless. Problem of that model is that the model is underfitting the data

In [36]:
# Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(12,12))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=["PRICE"], 
          class_names=["1","2","3","4","5"])
plt.show()

6.2 FREIGHT VALUE

In [37]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from sklearn import metrics

y = pd.DataFrame(data_clean['REVIEW_SCORE'])   # Response
X = pd.DataFrame(data_clean['FREIGHT_VALUE'])       # Predictor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3,  random_state =142)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
print(metrics.classification_report(y_train,y_train_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Train Set:\t")
print(cohen_kappa_score(y_train,y_train_pred))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
print("Accuracy  :\t", dectree.score(X_test, y_test))
print()
print(metrics.classification_report(y_test,y_test_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Test Set:\t")
print(cohen_kappa_score(y_test,y_test_pred))

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
plt.show()
Goodness of Fit of Model 	Train Dataset
Classification Accuracy 	: 0.5895214008319859

              precision    recall  f1-score   support

           1    0.00000   0.00000   0.00000      7258
           2    0.00000   0.00000   0.00000      2269
           3    0.00000   0.00000   0.00000      6007
           4    0.00000   0.00000   0.00000     14167
           5    0.58952   1.00000   0.74176     42656

    accuracy                        0.58952     72357
   macro avg    0.11790   0.20000   0.14835     72357
weighted avg    0.34754   0.58952   0.43728     72357


Kappa Score for Train Set:	
0.0

Goodness of Fit of Model 	Test Dataset
Classification Accuracy 	: 0.5881255441767901

Accuracy  :	 0.5881255441767901

              precision    recall  f1-score   support

           1    0.00000   0.00000   0.00000      2437
           2    0.00000   0.00000   0.00000       726
           3    0.00000   0.00000   0.00000      1996
           4    0.00000   0.00000   0.00000      4775
           5    0.58813   1.00000   0.74065     14185

    accuracy                        0.58813     24119
   macro avg    0.11763   0.20000   0.14813     24119
weighted avg    0.34589   0.58813   0.43560     24119


Kappa Score for Test Set:	
0.0

The kappa value is zero. I may therefore conclude that the fright value predictor has no impact on the customer's review score, i.e., that the shipping cost has no impact on the customer's satisfaction.Accuracy while testing and during training are nearly identical. The model therefore does not overfit. But the problem is that model must be underfitted. Model predicts all the review score as 4 as same as the price predictor model.

In [38]:
# Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(12,12))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=["FREIGHT_VLAUE"], 
          class_names=["1","2","3","4","5"])
plt.show()

6.3 TIME DIFFERENCE

In [39]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from sklearn import metrics


y = pd.DataFrame(data_clean['REVIEW_SCORE'])   # Response
X = pd.DataFrame(data_clean['TIME_DIFF'])       # Predictor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3,  random_state =142)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
print(metrics.classification_report(y_train,y_train_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Train Set:\t")
print(cohen_kappa_score(y_train,y_train_pred))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
print("Accuracy  :\t", dectree.score(X_test, y_test))
print()
print(metrics.classification_report(y_test,y_test_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Test Set:\t")
print(cohen_kappa_score(y_test,y_test_pred))

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
plt.show()
Goodness of Fit of Model 	Train Dataset
Classification Accuracy 	: 0.6169133601448374

              precision    recall  f1-score   support

           1    0.63702   0.33851   0.44209      7341
           2    0.00000   0.00000   0.00000      2237
           3    0.00000   0.00000   0.00000      6016
           4    0.00000   0.00000   0.00000     14189
           5    0.61577   0.99011   0.75931     42574

    accuracy                        0.61691     72357
   macro avg    0.25056   0.26572   0.24028     72357
weighted avg    0.42694   0.61691   0.49162     72357


Kappa Score for Train Set:	
0.12510129893799182

Goodness of Fit of Model 	Test Dataset
Classification Accuracy 	: 0.6181433724449604

Accuracy  :	 0.6181433724449604

              precision    recall  f1-score   support

           1    0.62658   0.33645   0.43781      2354
           2    0.00000   0.00000   0.00000       758
           3    0.00000   0.00000   0.00000      1987
           4    0.00000   0.00000   0.00000      4753
           5    0.61768   0.98949   0.76057     14267

    accuracy                        0.61814     24119
   macro avg    0.24885   0.26519   0.23968     24119
weighted avg    0.42653   0.61814   0.49263     24119


Kappa Score for Test Set:	
0.12087470706837566

We have a acceptable value for kappa. I can therefore state that a time difference affects the rating of a customer review. However, the model can only accurately predict review scores of 0 and 4.

In [40]:
# Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(12,12))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=["TIME_DIFF"], 
          class_names=["1","2","3","4","5"])
plt.show()

6.4 IS LATE

In [41]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from sklearn import metrics

y = pd.DataFrame(data_clean['REVIEW_SCORE'])   # Response
X = pd.DataFrame(data_clean['IS_LATE'])       # Predictor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3,  random_state =142)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
print(metrics.classification_report(y_train,y_train_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Train Set:\t")
print(cohen_kappa_score(y_train,y_train_pred))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
print("Accuracy  :\t", dectree.score(X_test, y_test))
print()
print(metrics.classification_report(y_test,y_test_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Test Set:\t")
print(cohen_kappa_score(y_test,y_test_pred))

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
plt.show()
Goodness of Fit of Model 	Train Dataset
Classification Accuracy 	: 0.5879873405475627

              precision    recall  f1-score   support

           1    0.00000   0.00000   0.00000      7304
           2    0.00000   0.00000   0.00000      2237
           3    0.00000   0.00000   0.00000      6029
           4    0.00000   0.00000   0.00000     14242
           5    0.58799   1.00000   0.74054     42545

    accuracy                        0.58799     72357
   macro avg    0.11760   0.20000   0.14811     72357
weighted avg    0.34573   0.58799   0.43543     72357


Kappa Score for Train Set:	
0.0

Goodness of Fit of Model 	Test Dataset
Classification Accuracy 	: 0.5927277250300593

Accuracy  :	 0.5927277250300593

              precision    recall  f1-score   support

           1    0.00000   0.00000   0.00000      2391
           2    0.00000   0.00000   0.00000       758
           3    0.00000   0.00000   0.00000      1974
           4    0.00000   0.00000   0.00000      4700
           5    0.59273   1.00000   0.74429     14296

    accuracy                        0.59273     24119
   macro avg    0.11855   0.20000   0.14886     24119
weighted avg    0.35133   0.59273   0.44116     24119


Kappa Score for Test Set:	
0.0

Here, our kappa score value is somewhat comparable to what we obtained for the time difference predictor.Therefore I can infer that the review score is affected by the Is_late predictor. Same as the time_difference univariate model, this model only predicts 0 and 3 review scores only

In [42]:
# Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(12,12))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=["TIME_DIFF"], 
          class_names=["1","2","3","4","5"])
plt.show()

PREDICTING REVIEW SCORE WITH UNI-VARIATE MODELS

The ones with the highest Classification Accuracy for both train and test:

  • IS_LATE
  • TIME_DIFFERENCE

The worst preditor for both train and test:

  • PRICE and FRIGHT_VALUE

Eventhough price is the worst predictor, the Classifcation Accuracy of all the models are pretty close to each other. Hence we still cannot decide which model is the best out of the four.

6.5 MULTI-VARIATE CLASSIFICATION TREE

On the data, I will now fit a multivariate decision tree classifier.

In [43]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, cohen_kappa_score
from sklearn import metrics

y = pd.DataFrame(data_clean['REVIEW_SCORE'])   # Response
X = pd.DataFrame(data_clean[['PRICE','FREIGHT_VALUE','TIME_DIFF','IS_LATE']])       # Predictor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3,  random_state =142)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model

y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

# Check the Goodness of Fit (on Train Data)
print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()
print(metrics.classification_report(y_train,y_train_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Train Set:\t")
print(cohen_kappa_score(y_train,y_train_pred))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
print("Accuracy  :\t", dectree.score(X_test, y_test))
print()
print(metrics.classification_report(y_test,y_test_pred, digits = 5,zero_division = 0))
print()
print("Kappa Score for Test Set:\t")
print(cohen_kappa_score(y_test,y_test_pred))

# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])
plt.show()
Goodness of Fit of Model 	Train Dataset
Classification Accuracy 	: 0.6171068452257557

              precision    recall  f1-score   support

           1    0.63382   0.33644   0.43956      7285
           2    0.00000   0.00000   0.00000      2249
           3    0.00000   0.00000   0.00000      6035
           4    0.00000   0.00000   0.00000     14153
           5    0.61616   0.98982   0.75952     42635

    accuracy                        0.61711     72357
   macro avg    0.25000   0.26525   0.23982     72357
weighted avg    0.42688   0.61711   0.49179     72357


Kappa Score for Train Set:	
0.1235707422667619

Goodness of Fit of Model 	Test Dataset
Classification Accuracy 	: 0.6175629172022057

Accuracy  :	 0.6175629172022057

              precision    recall  f1-score   support

           1    0.63636   0.34274   0.44552      2410
           2    0.00000   0.00000   0.00000       746
           3    0.00000   0.00000   0.00000      1968
           4    0.00000   0.00000   0.00000      4789
           5    0.61649   0.99036   0.75993     14206

    accuracy                        0.61756     24119
   macro avg    0.25057   0.26662   0.24109     24119
weighted avg    0.42670   0.61756   0.49211     24119


Kappa Score for Test Set:	
0.12550644921024967

Although the kappa value is acceptable, the confusion matrix indicates that this is a poor model. The model can only forecast review scores of 4 and 0. Although this model's accuracy is somewhat above average, we cannot declare that it is a good one for making predictions.

In [44]:
# Plot the trained Decision Tree
from sklearn.tree import plot_tree

f = plt.figure(figsize=(12,12))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=X_train.columns,
          class_names=["1","2","3","4","5"])
plt.show()

7. Conclusion

According to the analysis of the Exploratory Data in the study, delivery time has the biggest impact on customer happiness. Exploratory Data analysis of the time diff and is late shows that the majority of customers will give a lower score, which makes the majority of customers experience the delivery time. This research also used a random forest classifier predicting model, and it provided a poor accuracy score. Then I went on to review the results using a decision tree classifier with uni-variate models. The models demonstrated that the is Late and TIME DIFFERNCE as well as the kappa score have the highest Classification Accuracy for both the train and the test. This further demonstrates how delivery timing has the biggest impact. This also proves that delivery time is the most affect customer satisfaction

8. References and Resources

Harper c.(2018,July 27).Visualizing data with Bokeh and pandas[Online].Available https://programminghistorian.org/en/lessons/visualizing-with-bokeh

Jake VandaerPLas(2016,Nov),Python Data Science Handbook,Combining Datasets: Merge and Join[Online].Available https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html

Varun(2022)Pandas : Change data type of single or multiple columns of Dataframe in Python[Online].Available https://thispointer.com/pandas-change-data-type-of-single-or-multiple-columns-of-dataframe-in-python/

Nimit Vanawat(2021,August 12)How To Perform Exploratory Data Analysis -A Guide for Beginners[Online].Available https://www.analyticsvidhya.com/blog/2021/08/how-to-perform-exploratory-data-analysis-a-guide-for-beginners/

Nikolay Oskolkov(2021,Jan 7)Univariate vs. Multivariate Prediction[Online].Available https://towardsdatascience.com/univariate-vs-multivariate-prediction-c1a6fb3e009

What is an Ecommerce Business https://www.shopify.com/blog/ecommerce-business-blueprint

Ren, S., Choi, T.M., Lee, K.M. and Lin, L., 2020. Intelligent service capacity allocation for cross-border-E-commerce related third-party-forwarding logistics operations: A deep learning approach. Transportation Research Part E: Logistics and Transportation Review, 134, p.101834.

Nanduri, J., Jia, Y., Oka, A., Beaver, J. and Liu, Y.W., 2020. Microsoft uses machine learning and optimization to reduce e-commerce fraud. INFORMS Journal on Applied Analytics, 50(1), pp.64-79.

In [ ]:
 
In [ ]: